1 

Sparse/Robust Estimation and Kalman Smoothing with Nonsmooth 
Log-Concave Densities: Modeling, Computation, and Theory 



Aleksandr Y. Aravkin 

IBM T.J. Watson Research Center 
Yorktown, NY 10598 



SARAVKIN@US.IBM.COM 



James V. Burke 

Department of Mathematics, University of Washington 
Seattle, WA, USA 



BURKE@MATH.WASHINGTON.EDU 



Gianluigi Pillonetto 

Department of Information Engineering, University ofPadova 
Padova, Italy 



GIAPI@DEI.UNIPD.IT 



Editor: 



Abstract 



We introduce a class of quadratic support (QS) functions, many of which already play a crucial 
role in a variety of applications, including machine learning, robust statistical inference, sparsity 
promotion, and inverse problems such as Kalman smoothing. Well known examples of QS penal- 
ties include the £2, Huber, l\ and Vapnik losses. This paper builds on a dual representation for 
QS functions using convex analysis. The dual representation reveals the structure necessary for a 
QS function to be interpreted as the negative log of a true probability density, and thus provides 
the foundation for statistical interpretation and analysis of both known and new QS loss functions. 
Moreover, in some cases, it allows for the construction of non-smooth multivariate distributions 
with specified means and variances from simple scalar building blocks. Finally, for a broad sub- 
class of QS loss functions known as piecewise linear quadratic (PLQ) penalties, it is shown how 
the dual representation allows for the development of efficient numerical estimation schemes. The 
main contribution of this paper is a flexible statistical modeling framework for a variety of learn- 
ing applications, together with a toolbox of efficient numerical methods for estimation using these 
densities. In particular, for PLQ densities, we show that interior point (IP) methods can be used. 
IP methods solve nonsmooth optimization problems by working directly with smooth systems of 
equations characterizing the optimality of these problems. The efficiency of the IP approach de- 
pends on the structure of particular applications. We consider the class of dynamic inverse problems 
using Kalman smoothing. This class comprises a wide variety of applications, where the aim is to 
reconstruct the state of a dynamical system with known process and measurement models starting 
from noisy output samples. In the classical case, Gaussian errors are assumed both in the pro- 
cess and measurement models for such problems. The extended framework allows arbitrary PLQ 
densities to be used, and the proposed IP approach solves the generalized Kalman smoothing prob- 
lem while maintaining the linear complexity in the size of the time series, just as in the Gaussian 
case. This extends the computational efficiency of the Mayne-Fraser and Rauch-Tung-Striebel al- 
gorithms to a much broader nonsmooth setting, and includes many recently proposed robust and 
sparse smoothers as special cases. 

Keywords: statistical modeling; convex analysis; nonsmooth optimization; robust inference; spar- 
sity optimization; Kalman smoothing; interior point methods 
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1. Introduction 



Consider the classical problem of Bayesian parametric regression (MacKay 1992 Roweis and 
Ghahramani 1999 1 where the unknown x G M." is a random vectoiQ with a prior distribution speci- 
fied using a known invertible matrix G £ M"*" and known vector fx € K" via 



}l = Gx + w 



(1.1) ? LinearProcess ? 



where w is a zero mean vector with covariance Q. Let z denote a linear transformation of x contam- 
inated with additive zero mean measurement noise v with covariance R, 



Hx + v 



(1.2) ? LinearModel ? 



where H £ I^ x " is a known matrix, while v and w are independent. It is well known that the 
(unconditional) minimum variance linear estimator of x, as a function of z, is the solution to the 
following optimization problem: 



Td-1, 



min (z-HxyR 



Hx) + (ju - Gx) J Q~ l (ji - Gx) . 



(1.3) ?MainObjective? 



As we will show, ( 1.3 ) includes estimation problems arising in discrete-time dynamic linear systems 
which admit a state space representation (Anderson and Moore 1979| Brockett| 1970| ). In this 
context, x is partitioned into N subvectors {x&}, where each x^ represents the hidden system state 
at time instant k. For known data z, the classical Kalman smoother exploits the special structure of 



the matrices H, G, Q and R to compute the solution of ( 1.3 ) in 0(N) operations (Gelb 1974 1. This 



procedure returns the minimum variance estimate of the state sequence {xk\ when the additive noise 
in the system is assumed to be Gaussian. 



In many circumstances, the estimator ( 1.3 ) performs poorly; put another way, quadratic penalization 
on model deviation is a bad model in many situations. For instance, it is not robust with respect 
to the presence of outliers in the data (Huber 1981 Gao 2008 [ Aravkin et al. 201 la[ Farahmand 



et al. 201 1 1 and may have difficulties in reconstructing fast system dynamics, e.g. jumps in the state 
values (Ohlsson et al. 201 1[ ). In addition, sparsity -promoting regularization is often used in order to 
extract a small subset from a large measurement or parameter vector which has greatest impact on 
the predictive capability of the estimate for future data. This sparsity principle permeates many well 
known techniques in machine learning and signal processing, including feature selection, selective 



shrinkage, and compressed sensing (Hastie and Tibshirani 1990, Efron et al. 2004 Donoho 2006). 



In these cases, ( 1.3 1 is often replaced by a more general formulation 



min V(Hx-z;R)+W(Gx-n;Q) 



(1.4) ?probTwo? 



where the loss V may be the ^-norm, the Huber penalty (Huber 1981), Vapnik's e-insensitive 
loss (used in support vector regression (Vapnik, 1998) see also (Hast ie et al.[ 200 1[ |) or the hinge 
loss (leading to support vector classifiers (Evgeniou et aL 2000 Pontil and Verri 1998[|Scholkopf 
et al. 2000)). The regularizer W may be the ^2-norm, the ^i-norm (as in the LASSO (Tibshirani, 



1996)), or a weighted combination of the two, yielding the elastic net procedure (Zou and Hastie 



2005[ ). Many learning algorithms using infinite-dimensional reproducing kernel Hilbert spaces as 
hypothesis spaces ( |Aronszajnj |1950[ |Saitoh| |1988| |Cucker and Smale| |2001[ ) boil down to solving 



1. All vectors are column vectors, unless otherwise specified 
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finite-dimensional problems of the form (1.4) by virtue of the representer theorem (Wahba 1998} 
Scholkopf etal|[2U0T] ). 



These robust and sparse approaches can often be interpreted as placing non-Gaussian priors on w 
(or directly on x) and on the measurement noise v. The Bayesian interpretation of ( |1.4| ) has been 
extensively studied in the statistical and machine learning literature in recent years and probabilistic 
approaches used in the analysis of estimation and learning algorithms can be found e.g. in (Mackay] 
1994[ Tipping| 2001} Wipf et al} 2011 1. Non-Gaussian model errors and priors leading to a great 



variety of loss and penalty functions are also reviewed in (Palmer et al. 2006) using convex-type 
representations, and integral-type variational representations related to Gaussian scale mixtures. 
In contrast to the above approaches, in the first part of the paper, we consider a wide class of 
quadratic support (QS) functions and exploit their dual representation. This class of functions gen- 
eralizes the notion of piecewise linear quadratic (PLQ) penalties Rockafellar and Wets ( 1998 ). The 
dual representation is the key to identifying which QS loss functions can be associated with a den- 
sity, which in turn allows us to interpret the solution to the problem ( 1 .4 ) as a MAP estimator when 
the loss functions V and W come from this subclass of QS penalties. This viewpoint allows statisti- 
cal modeling using non-smooth penalties, such as the l\, hinge, Huber and Vapnik losses, which are 
all PLQ penalties. Identifying a statistical interpretation for this class of problems gives us several 
advantages, including a systematic constructive approach to prescribe mean and variance parame- 
ters for the corresponding model; a property that is particularly important for Kalman smoothing. 
In addition, the dual representation provides the foundation for efficient numerical methods in esti- 
mation based on interior point optimization technology. In the second part of the paper, we derive 
the Karush-Kuhn-Tucker (KKT) equations for problem ( 1.4 1, and introduce interior point (IP) meth- 
ods, which are iterative methods to solve the KKT equations using smooth approximations. This is 
essentially a smoothing approach to many (non-smooth) robust and sparse problems of interest to 
practitioners. Furthermore, we provide conditions under which the IP methods solve (1.4) when V 
and W come from PLQ densities, and describe implementation details for the entire class. 
A concerted research effort has recently focused on the solution of regularized large-scale inverse 
and learning problems, where computational costs and memory limitations are critical. This class 
of problems includes the popular kernel-based methods ( |Rasmussen and Williams| [2006| |Scholkopf 
and Smola}|2001[|Smola and Scholkopf, 2003), coordinate descent methods ( Tseng and Yun[ 2008 [ 



Lucidi et"aLj 2007 Dinuzzo||2011[ ) and decomposition techniques ( |Joachims 1998 Lin 200 1| Lu- 



cidi et al. 2007 1, one of which is the widely used sequential minimal optimization algorithm for sup- 



port vector machines (Piatt 1998[ ). Other techniques are based on kernel approximations, e.g. using 
incomplete Cholesky factorization (Fine and Scheinberg, 2001), approximate eigen-decomposition 
(Zhang a nd Kwok||2010[ ) or truncated spectral representations ( |Pillonetto and Bell}|2007| ). Efficient 
interior point methods have been developed for £\ -regularized problems ( |Kim et al] |2007| ), and for 
support vector machines (|Ferris and Munson)|2003[). 



In contrast, general and efficient solvers for state space estimation problems of the form ( 1 .4 1 are 
missing in the literature. The last part of this paper provides a contribution to fill this gap, spe- 
cializing the general results to the dynamic case, and recovering the classical efficiency results of 
the least-squares formulation. In particular, we design new Kalman smoothers tailored for systems 
subject to noises coming from PLQ densities. Amazingly, it turns out that the IP method used in 
( Aravkin et al. 201 la I generalizes perfectly to the entire class of PLQ densities under a simple ver- 
ifiable non-degeneracy condition. In practice, IP methods converge in a small number of iterations, 
and the effort per iteration depends on the structure of the underlying problem. We show that the IP 



3 



iterations for all PLQ Kalman smoothing problems can be computed with a number of operations 
that scales linearly in N, as in the quadratic case. This theoretical foundation generalizes the results 
recently obtained in ( Aravkin et al.[ 2011a|b Farahmand e t al.[ 201 1 Ohlsson et al. 201 1 1, framing 
them as particular cases of the general framework presented here. 

The paper is organized as follows. In Section|2]we introduce the class of QS convex functions, and 
give sufficient conditions that allow us to interpret these functions as the negative logs of associated 
probability densities. In Section[3]we show how to construct QS penalties and densities having a de- 
sired structure from basic components, and in particular how multivariate densities can be endowed 
with prescribed means and variances using scalar building blocks. To illustrates this procedure, 
further details are provided for the Huber and Vapnik penalties. In Section [4] we focus on PLQ 
penalties, derive the associated KKT system, and present a theorem that guarantees convergence of 
IP methods under appropriate hypotheses. In Section [5] we present the Kalman smoothing dynamic 
model, formulate Kalman smoothing with PLQ penalties, present the KKT system for the dynamic 
case, and show that IP iterations for PLQ smoothing preserve the classical computational efficiency 
known for the Gaussian case. We present numerical examples using both simulated and real data in 
Section [6] and make some concluding remarks in Section [7] Section [8] serves as an appendix where 
supporting mathematical results and proofs are presented. 



2. Quadratic Support Functions and Densities 

In this section, we introduce the class of Quadratic Support (QS) functions, characterize some of 
their properties, and show that many commonly used penalties fall into this class. We also give a 
statistical interpretation to QS penalties by interpreting them as negative log likelihoods of prob- 
ability densities; this relationship allows prescribing means and variances along with the general 
quality of the error model, an essential requirement of the Kalman smoothing framework and many 
other areas. 



2.1 Preliminaries 

We recall a few definitions from convex analysis, required to specify the domains of QS penalties. 
The reader is referred to ( Rockafellar| 1970[ Rockafellar and Wets||1998( ) for more detailed reading. 

• (Affine hull) Define the affme hull of any set C C W 1 , denoted by aff(C), as the smallest affine 
set (translated subspace) that contains C. 

• (Cone) For any set C C W, denote by cone C the set {tr\r G C,t G 

• (Domain) For f(x) : W 1 -)• I = {MU°°}, dom(/) = {x : f(x) < °°}. 

• (Polars of convex sets) For any convex set C C M m , the polar of C is defined to be 

C° :={r\(r,d) <lVJeC}, 
and if C is a convex cone, this representation is equivalent to 

C° :={r\(r,d) <0VdGC}. 

• (Horizon cone). Let C C R" be a nonempty convex set. The horizon cone C°° is the convex 
cone of 'unbounded directions' for C, i.e. d G C°° if C + d C C. 
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(Barrier cone). The barrier cone of a convex set C is denoted by bar(C): 
bar(C) := {x*|for some J8 G M, (x, x*) < J8 Vx G C} . 

(Support function). The support function for a set C is denoted by 8* (x \ C ) : 



5*(x|C) := sup(x, c) . 

ceC 



2.2 QS functions and densities 



We now introduce the QS functions and associated densities that are the focus of this paper. We 
begin with the dual representation, which is crucial to both establishing a statistical interpretation 
and to the development of a computational framework. 

Definition 1 (Quadratic Support functions and penalties) A QS function is any function p(U,M,b,B; •) : 
MP — > M. having representation 

p(U,M,b,B;y) = sup {(u,b + By) - \{u,Mu)} , (2.1)?PLQpenaity? 
ueU 

where U C R m is a nonempty convex set, M G J?" the set of real symmetric positive semidefinite 
matrices, and b + By is an injective affine transformation in y, with B G M mxn , so, in particular, 
m<n and null(B) = {0}. 

When G U, we refer to the associated QS function as a penalty, since it is necessarily non- 
negative. 

Remark 2 When U is polyhedral, G U, b = and B = I, we recover the basic piecewise linear- 
quadratic penalties characterized in (Rockafellar and Wets 1998\ Example 11.18). 



Theorem 3 Let U,M,B,bbe as in Definition^ and set K = U°° nnull(M). Then 

B- l [bai(U) +Rm(M) - b] c dom\p(U,M,B,b; ■)] cB- l [K°-b] , 

with equality throughout when bai(U) + Ran (M) is closed, where bar(U) = dom (8* (• | U)) is the 
barrier cone of U. In particular, equality always holds when U is polyhedral. 

We now show that many commonly used penalties are special cases of QS (and indeed, of the 
PLQ) class. 

Remark 4 (scalar examples) £2, i\, elastic net, Huber, hinge, and Vapnik penalties are all repre- 
sentable using the notation of Definition^ 

1. £ 2 : Take U = M, M = 1, b = 0, and B = 1. We obtain 

p(y) = sup {uy — u 2 / 2} . 

The function inside the sup is maximized at u = y, hence p(y) = \y 2 , see top left panel of 
Fig.® 
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Figure 1: Scalar £2 (top left), l\ (top right), Huber (middle left), Vapnik (middle right), elastic net 
(bottom left) and smooth insensitive loss (bottom right) penalties 



2. l x : Take U = [-1,1], M = 0, b = 0, and B = 1. We obtain 



P(y) = sup {uy} . 

bG[-1,1] 

The function inside the sup is maximized by taking u = sign(y), hence p(y) = \y\, see top right 
panel of Fig. [7] 



3. Elastic net: li-\-Xl\. Take 

£/ = Mx [-X,X\, b 



M 



i 





B 



This construction reveals the general calculus of PLQ addition, see Remark [5] See bottom 
right panel of Fig. [7] 



4. Huber: Take U 



■K, k], M = 1, b = 0, and B = 1. We obtain 
p(y) = sup {uy-u 2 /2} , 



1 „2 
2 K ■ 



with three explicit cases: 

(a) Ify < —K, take u = —K to obtain —Ky- 

(b) If — K <y < K, take u = y to obtain ^y 2 . 

(c) Ify > K, take u = K to obtain a contribution of Ky — ~ K 2 . 

This is the Huber penalty, shown in the middle left panel of Fig. [7J 
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5. Hinge loss: Taking B = 1, b = —s,M = and U = [0, 1] we have 



p(y) = sup{(y-e)u} = (y-e)+. 

ueU 

To verify this, just note that ify < s, u* = 0; otherwise u* = 1. 

6. Vapnik loss is given by (y — e) + + (— y — e) + . We immediately obtain its PLQ representation 
by taking 



B 



' 1 " 




£ 


,M = 





0" 




,b = - 




-1 




£ 









, 17 = [0,1] x [0,1] 



to yield 



p(y) = sup 

ueu 



y-e 
-y-£ 



= Cv- £)+ + (-?-£)+• 

r/ze Vapnik penalty is shown in the middle right panel of Fig. [7] 



7. hinge loss function (ChueFaL^ 2001 ). Combining ideas from examples^and^ we can 
construct a 'soft' hinge loss; i.e. the function 



P(y) 



o 



if y<£ 
k K:(3;-e)-i(K:) 2 e + K<y. 



\(y-e) 2 



that has a smooth (quadratic) transition rather than a kink at £ : Taking B=\,b= —£, M = 1 
and U = [0, ff] we have 



P(y)= sup {{y-£)u} 
ue[o,K] 



1„2 



To verify this function has the explicit representation given above, note that ify < £, u* = 0; 
if £ < y < K + £, we have u* = (y — e) + , and ifx + £ <y, we have u* = K. 



8. Soft insensitive loss function < \Chu et aL\ \2001 \. Using example^ we can create a symmetric 
soft insensitive loss function (which one might term the Hubnik) by adding together to soft 
hinge loss functions: 

p(y)= sup {(y — £)u} — \u 2 + sup {(— y — £)u} — \u 2 

«efo,)cl 



ue[o,K] 
sup 



y-£ 
-y-£ 



1 T 

\u 



1 
1 



ue[o,K] 2 

See bottom bottom right panel of Fig. [7] 

Note that the affine generalization (Definition[T]) is needed to form the elastic net, the Vapnik penalty, 
and the SILF function, as all of these are sums of simpler QS penalties. These sum constructions are 
examples of a general calculus which allows the modeler to build up a QS density having a desired 
structure. This calculus is described in the following remark. 
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Remark 5 Let Pi(y) and Pziy) be two QS penalties specified by Uj,Mj,bj,Bj, for i = 1,2. Then the 
sum p (y) = pi (y) + Pz(y) is also a QS penalty, with 



"Mi 


" 


,b = 


V 


,B = 







M 2 






B 2_ 



U = U l xU 2 ,M 



Notwithstanding the catalogue of scalar QS functions in Remark 4 and the gluing procedure de- 
scribed in Remark [5] the supremum in Definition [T] appears to be a significant roadblock to under- 
standing and designing a QS function having specific properties. However, with some practice the 
design of QS penalties is not as daunting a task as it first appears. A key tool in understanding the 
structure of QS functions are Euclidean norm projections onto convex sets. 



Theorem 6 (Projection Theorem for Convex Sets) ftlarantonello ( 1971 )] Let Q £ W ixn be sym- 
metric and positive definite and let C C K be non-empty, closed and convex. Then Q defines and 

inner product on M." by (x, t)g = x T Qy with associated Euclidean norm \\x\\q = w (x, x)q. The 



projection of a point yGR" onto C in norm \\ ■ \\q is the unique point | C) solving the least 
distance problem 

inf||v— x\\q, (2.2)? proj dist ? 

xeC 



and z, = Pq (y \ C ) if and only if z, G C and 

{x — z,y — z) Q <0 VieC. 



(2.3)?proj oc? 



Note that the least distance problem (2.2 1 is equivalent to the problem 



inf h\\y — x\\ 



XEC 



In the following lemma we use projections as well as duality theory to provide alternative represen- 
tations for QS penalties. 



nnxk 



any 



Theorem 7 Let M € M" x " be symmetric and positive semi-definite matrix, let L £ 
matrix satisfying M = LL T where k = rank(M), and let U cK" be a non-empty, closed and convex 
set that contains the origin. Then the QS function p := p(U,M,0,I; •) has the primal representations 



p{y) = inf \l\\s\$ + S*(y-Ls\U)] = inf \ U\s\\ 2 2 + y (y ~ Ls \ U°) 

seR k iSR* 

where, for any convex set V, 

8*(z\V) :=sup(z,v) and y(z \ V ) := inf {t \ t > 0, z G tV } 

vev 



are the support and gauge functionals for V, respectively. 

If it is further assumed that M £ ^ 7 I ! + the set of positive definite matrices, then p has the represen- 



(2.4) ?QS primal? 
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tations 



p(y) = inf[\\\s\\ 2 M + y{M- l y-s\M- l U°)} (2.5) {?} 

= \\\P M {M- l y\U)\\ 2 M + y{M- { y-P M {M- l y\U)\M- l U ) (2.6) {?} 

= inf ;[i |M&_,+ 70 -J I U°)] (2.7) {?} 

= \\\P M -i(y\MU)f M ^+y(y-P M -,(y\MU)\U ) (2.8){?} 

= i/M-V-Millw-M-VllM (2.9) {?} 

uEU 

= \\\P M {M- x y\U)\\ 2 M + {M- { y-P M {M- l y\U),P M {M- l y\U)) M (2.10) {?} 

= i/M-V- inf illv-yll^ (2.11){?} 

vEMU 

= \\\P M Ay\MU)\\ 2 M - i+ (y-P M Ay\MU),P M Ay\MU)) M ^ . (2.12){?} 
/n particular, ( |2. 1 1[ > says p(y) = \^y T M^ x y whenever y 6 Mt/. A/so note frtaf, ( ]2.4[ ), one can 



replace the gauge junctionals in (2.5 >-( [2T8| > oy ?/ze support functional of the appropriate set where 
M- l U° = (MU)°. 



The formulas ( |2.5| )-( |27T2| ) show how one can build PLQ penalties having a wide range of desir- 
able properties. We now give a short list of a few examples illustrating how to make use of these 
representations. 

Remark 8 (General examples) In this remark we show how the representations in Lemma [7] can 
be used to build QS penalties with specific structure. In each example we specify the components 
U,M,b, and B for the QS function p := p(U,M,b,B;-). 

1. Norms. Any norm || • || can be represented as a QS function by taking M = 0, B = I, b = 0, 
U = B°, where B is the unit ball of the desired norm. Then, by (|2.4[), p (y) = \\y || = y (y | B ). 



2. Gauges and support functions. Let U be any closed convex set containing the origin, and 
Take M = 0,B = I,b = 0. Then, by p{y) = y(y\U°) = 8* (y\U). 

3. Generalized Huber functions. Take any norm || • || having closed unit ball B. Let M E 
B = I, b = 0, and U = B°. Then, by the representation (|2.8[), 



P(y) = ^PM-'iylMM^M-'P^iylMM^ + Wy-P^iylMM^W . (2.13)? g-Huber ? 
In particular, for y G MB°, p(y) = \y T M~ l y. 

If we take M = 1 and \\-\\ = K~ l \\ - \\ifar K>0(i.e. U = kB^ and U° = K^BJ, then p is the 
multivariate Huber function described in item 4 of Remark^ In this way, Theorem^shows 
how to generalize the essence of the Huber norm to any choice of norm. For example, if we 
take U = kEm = {ku \ \\u\\m < 1 }, then, by ( |2.10| ), 



P(y) 



i\\y\\ 2 M -i ,if IblU-i < « 
k|MIm-i-t ^IMIm-' > * ■ 
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4. Generalized hinge-loss functions. Let || • || be a norm with closed unit ball B, let K be a 
non-empty closed convex cone in W 1 , and let v € W. Set M = 0, b = —v, B = I, and U = 
-(WDK°) = B° n {-K)°. Then, by $urke\\l987\ Section 2), 



p(y) = dist(y|v — K) = inf \\y — b + u\\ . 

If we consider the order structure "<k" induced on W by 

y< K v <^> v-y£K, 

then p (y) = if and only ify v. By taking \\ ■ \\ = \\ ■ || i, K = W\ so (—K)° = K, and v = el, 
where 1 is the vector of all ones, we recover the multivariate hinge loss function in Remark^ 

5. Order intervals and Vapnik loss functions. Let \\ ■ \\ be a norm with closed unit ball B, let 
K CM" be a non-empty symmetric convex cone in the sense that K° = —K, and let w <k V, 
or equivalently, v — w G m\x(K). Set 



U 



'(IK) x 



M 








and B 



Then 



Observe that p(y) 



p(y) = dist(>| v-K) + dist (y \ w + K ] 



or- 



der interval" (Schaefer 1970). If we take w 



if and only if w <k y <k v. The set {y \ w <k y <k v } is an 

-v, then {y \ —v <k y <k v} is a symmetric 
neighborhood of the origin. By taking || • || = || • K = R", and v = el=-w, we recover the 
multivariate Vapnik loss fu nction in Remark^ Further examples of symmetric cones are 
and the Lorentz, or £2 cone ( Guler and Hauser 2002 ). 



The examples given above show that one can also construct generalized versions of the elastic net 
as well as the soft insensitive loss functions defined in Remark |4] In addition, cone constraints can 
also be added by using the identity 8* (• | K° ) = 8 (• | K ). These examples serve to illustrate the wide 
variety of penalty functions representable as QS functions. Computationally, one is only limited by 
the ability to compute projections described in Theorem[7] Further computational properties for QS 
functions are described in ( |Aravkin et al.|[20T2l Section 6). 

In order to characterize QS functions as negative logs of density functions, we need to ensure the 
integrability of said density functions. The function p(y) is said to be coercive if lim|| 3 ,||^ 00 p(y) = 00, 
and coercivity turns out to be the key property to ensure integrability. The proof of this fact and 
the characterization of coercivity for QS functions are the subject of the next two theorems (see 
Appendix for proofs). 

Theorem 9 (QS integrability) Suppose p{y) is a coercive QS penalty. Then the function exp[— p(y)] 
is integrable on aff[dom(p)] with respect to the dim(aff[dom(p)])-dimensional Lebesgue measure. 

Theorem 10 A QS function p is coercive if and only if [B T cone(U)}° = {0}. 



B 



Theorem 10 can be used to show the coercivity of familiar penalties. In particular, note that if 
= 7, then the QS function is coercive if and only if U contains the origin in its interior. 
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Corollary 11 The penalties £2, l\, elastic net, Vapnik, and Huber are all coercive. 



Proof We show that all of these penalties satisfy the hypothesis of Theorem 10 
£ 2 : U = R and 5 = 1, so [B T cone(£/)]° = R° = {0}. 

£1: 17 = [-1,1], so cone (£/) =M, and fi = 1. 
Elastic Net: In this case, cone(£/) = R 2 and B = [ \ ] . 
Huber: U = [— K, k], so cone(£/) = R, and B = 1. 

Vapnik: £/ = [0, 1] x [0, 1], so cone(£/) =K^_. B = [Ji], so fi T cone(£/) = M. 

■ 

One can also show the coercivity of the above examples using their primal representations. How- 
ever, our main objective is to pave the way for a modeling framework where multi-dimensional 
penalties can be constructed from simple building blocks and then solved by a uniform approach, 
using the dual representations alone. 

We now define a family of distributions on W by interpreting piecewise linear quadratic func- 
tions p as negative logs of corresponding densities. Note that the support of the distributions is 
always contained in dom p, characterized, in Theorem [3] 

Definition 12 (QS densities) Let p(U,M,B,b;y) be any coercive extended QS penalty on W 1 . De- 
fine p(y) to be the following density on W 1 : 

, . I c _1 exp [— p(y)] yGdomp 
p(y)={ FL FyJn y F (2.14)?PLQdensity? 



where 



else, 



/ exp[-p(y)]dy 

'yGdom p 



and the integral is with respect to the dim(dom(p )) -dimensional Lebesgue measure. 

QS densities are true densities on the affine hull of the domain of p. The proof of Theorem [9] 
can be easily adapted to show that they have moments of all orders. 



3. Constructing QS densities 



In this section, we describe how to construct multivariate QS densities with prescribed means and 
variances. We show how to compute normalization constants to obtain scalar densities, and then 
extend to multivariate densities using linear transformations. Finally, we show how to obtain the data 
structures U,M,B,b corresponding to multivariate densities, since these are used by the optimization 
approach in Section |4j 

We make use of the following definitions. Given a sequence of column vectors {r^} = {r\ , . . . , r^} 
and matrices {L^} = {Ej , . . . , IW}, we use the notation 



vec({r k }) 



rN 



, diag({£ )t }) 



El 






£2 









Ejv 
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In definition [T2l QS densities are defined over W. The moments of these densities depend in 



a nontrivial way on the choice of parameters b,B,U,M. In practice, we would like to be able to 
construct these densities to have prescribed means and variances. We now show how this can be 
done using scalar QS random variables as the building blocks. Suppose y = vec ({;%}) is a vector 
of independent (but not necessarily identical) QS random variables with mean and variance 1. 
Denote by bk,Bk, Uu,M k the specification for the densities of y^. To obtain the density of y, we need 
only take 

U = Ui x U 2 x • • • x U N 
M = diag({M k }) 
B = diag({B,}) 
b = vec({b k }) . 



M 



I, b = 0, B = I, while 
-1,1]", M = 0, b = 0, 



For example, the standard Gaussian distribution is specified by U = 
the standard £i-Laplace (see (Aravkin et al. 2011a)) is specified t 
B = V2I. 

The random vector y = Q 1 ^ 2 (y+H) has mean ji and variance Q. If c is the normalizing constant for 
the density of y, then cdet(2) 1 / 2 is the normalizing constant for the density of y. 



Remark 13 Note that only independence of the building blocks is required in the above result. 
This allows the flexibility to impose different QS densities on different errors in the model. Such 
flexibility may be useful for example when combining measurement data from different instruments, 
where some instruments may occasionally give bad data (with outliers), while others have errors 
that are modeled well by Gaussian distributions. 

We now show how to construct scalar building blocks with mean and variance 1, i.e. how to 
compute the key normalizing constants for any QS penalty. To this aim, suppose p (y) is a scalar QS 
penalty that is symmetric about 0. We would like to construct a density p(j) = exp [— p(ciy)] jc\ to 
be a true density with unit variance, that is, 

— f exp [— p (ciy)] dy = 1 and — [y 2 exp [— p(c2y)]dy = 1, (3.1)?PLQconst? 
ci J C\ J 

where the integrals are over R. Using ^-substitution, these equations become 

CiC 2 = J exp[-p(y)]dy and c\c\ = J J 2 exp [-p(y)] dy. 
Solving this system yields 



ci = J j y 2 exp [-p Cy)] dy I J exp [-p (y)} dy 
ci = — ( exp[-p(y))dy . 

C2 J 

These expressions can be used to obtain the normalizing constants for any particular p using simple 
integrals. 
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3.1 Huber Density 

The scalar density corresponding to the Huber penalty is constructed as follows. Set 

PhM = — exp[-p H (c 2 y)] , 
where c\ and c 2 are chosen as in (3.1 ). Specifically, we compute 

j exp[-p ff (y)] dy = 2exp [-K 2 /2\ - + V2k[2®(k) - 1] 
Jy 2 exp[- PH (y)}dy = 4exp[-K 2 /2} ] -^ + V2n[2<l>(K)-\} , 



(3.2) ? HuberDensityScal 



where <J> is the standard normal cumulative density function. The constants c\ and c 2 can now be 
readily computed. 

To obtain the multivariate Huber density with variance Q and mean pL, let U = [— K, k]'\ M = I, 
B = I any full rank matrix, and b = 0. This gives the desired density: 



Ph(j) 



1 



c' 1 , det(2 1 /2) 



exp 



-sup<j(c 2 e 1/2 {y-n),u)-^u l u 
ueu I v ll 



1 



(3.3) ?HuberDensity? 



3.2 Vapnik Density 

The scalar density associated with the Vapnik penalty is constructed as follows. Set 

PvO) = — exp[-p v (c 2 y)] , 



(3.4) ?VapnikDensitySce 



where the normalizing constants c\ and c 2 can be obtained from 

jexp[- Pv (y)]dy = 2{e + l) 

f 2 

J y 2 exp[-p v (y)]dy = -e 3 +2(2 - 2e + e 2 ), 

using the results in Section[3] Taking U = [0, l] 2 ", the multivariate Vapnik distribution with mean /I 
and variance Q is 



pv(y) 



^det(Q 1 /2) 



exp 



- sup { (c 2 BQ- l l 2 (y - n) - el 2n ,u) } 



(3.5) ? VapnikDensity ? 



where B is block diagonal with each block of the form B = [_\] , and 1 2 „ is a column vector of l's 
of length 2n. 

4. Optimization with PLQ penalties 

In the previous sections, QS penalties were characterized using their dual representation and inter- 
preted as negative log likelihoods of true densities. As we have seen, the scope of such densities is 
extremely broad. Moreover, these densities can easily be constructed to possess specified moment 
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properties. In this section, we expand on their utility by showing that the resulting estimation prob- 



lems ( |1.4[ ) can be solved with high accuracy using standard techniques from numerical optimization 
for a large subclass of these penalties. We focus on PLQ penalties for the sake of simplicity in 
our presentation of an interior point approach to solving these estimation problems. However, the 
interior point approach applies in much more general settings, e.g. see Nemirovskii and Nesterov 
( 1994). Nonetheless, the PLQ case is sufficient to cover all of the examples given in Remark|4] while 
giving the flavor of how to proceed in the more general cases. 



We exploit the dual representation for the class of PLQ penalties ( Rockafellar and Wets| 1998] ) 
to explicitly construct the Karush-Kuhn-Tucker (KKT) conditions for a wide variety of model prob- 
lems of the form (|1.4[). Working with these systems opens the door to using a wide variety of 



numerical methods for convex quadratic programming to solve ( 1.4 ). 

Let p(U v ,M v ,b v ,B v ;y) and p(U w ,M w ,b w ,B w ;y) be two PLQ penalties and define 



and 



Then ( 1 .4 1 becomes 



V(v;R) := p{U Vl M v ,b v ,B v ;R- l/2 \ 



W(w;Q) :=p{U w ,M w ,b w ,B w ;Q- 1 / 2 w). 



rmnp(U,M,b,B;y), 



(4.1) ? PLQ-V ? 

(4.2) ? PLQ-W ? 

(4.3) ?newopt? 



where 



and 



U :=U v x U w , M :-- 



~M V 


" 












'by-ByR-Vh 
J> w -B W Q-V 2 yi 



B:- 



B V R l l 2 H 
B W Q l ' 2 G 



Moreover, the hypotheses in ( |1.1| >, ( |1.2[ ), ( |1.4| >, and ( |2.1[ ) imply that the matrix B in ( |4.3| ) is injective. 
Indeed, By = if and only if B w Q~ 1 ' 2 Gy = 0, but, since G is nonsingular and B w is injective, this 
implies that y = 0. That is, nul(fi) = {0}. Consequently, the objective in ( |4.3[ ) takes the form of a 
PLQ penalty function ( |2.1[ ). In particular, if ( |4.1| ) and ( |4.2| ) arise from PLQ densities (definition [12]), 
then the solution to problem ( |4.3| ) is the MAP estimator in the statistical model ( |l.l[ )-( fL"2"| ). 

To simplify the notational burden, in the remainder of this section we work with (|4.3|> directly 



and assume that the defining objects in ( |4.3| ) have the dimensions specified in ( |2.1[ ); 

U G M. m , M G R mxm , b e R m , and B € R mxn . (4.4) ? PLQdimS P ecs ? 

The Lagrangian ( Rockafellar and Wets| 1998) [Example 11.47] for problem (4.3 ) is given by 



L(y, u) = b T u — -u T Mu + u J By . 
By assumption U is polyhedral, and so can be specified to take the form 

U = {u : A J u < a} , 



(4.5) ?polyhedralU? 
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where A G 



Using this reprsentation for U, the optimality conditions for (4.3 1 (Rockafellar 



1970 Rockafellar and Wets 1998 1 are 







B J u 



= b + By-Mu-Aq 

= A T u + s-a 

= qjSi , i = 1 , . . . , I , q, s > 



(4.6) ? fullKKT ? 



where the non-negative slack variable s is defined by the third equation in ( |4.6| >. The non-negativity 
of s implies that u G U. The equations = qiSi , i = 1 , . . . ,£ in (|4.6[> are known as the complemen- 



tarity conditions. By convexity, solving the problem ( |4.3| ) is equivalent to satisfying ( |4.6[ ). There 
is a vast optimization literature on working directly with the KKT system. In particular, interior 
point (IP) methods (Kojima et al. 1991 Nemirovskii and Nesterov| 1994, Wright, 1997| ) can be 
employed. In the Kalman filtering/smoothing application, IP methods have been used to solve the 
KKT system (4.6) in a numerically stable and efficient manner, see e.g. ( Aravkin et al.| 201 lb I. 
Remarkably, the IP approach used in ( |Aravkin et al.| 201 lb| ) generalizes to the entire PLQ class. 
For Kalman filtering and smoothing, the computational efficiency is also preserved (see Section [5] 
Here, we show the general development for the entire PLQ class using standard techniques from the 
IP literature. 

Let U,M,b,B, and A be as defined in ( |2.1| ) and ((43), and let T £ (0,+°°]- We define the x slice 
of the strict feasibility region for (4.6) to be the set 



< s, < q, s T q < T, and 



(s,q,u,y) satisfy the affine equations in ( |4.6 1 



and the central path for ( |4.6[ ) to be the set 

< s, < q, y = qiSi i= \ and 



(s,q,u,y) 



(s,q,u,y) satisfy the affine equations in ( |4.6| ) 



For simplicity, we define := j£" + (+oo). The basic strategy of a primal-dual IP method is to 
follow the central path to a solution of ( |4.6| ) as y J, by applying a predictor-corrector damped 
Newton method to the function mapping R e x M. £ x R" 1 x W to itself given by 



F r (s,q,u,y) 

where D(q) and D(s) are diagonal matrices with vectors q,s on the diagonal. 



s+A T u ■ 



D{q)D(s)\-y\ 
By-Mu-Aq + b 



B T u 



(4.7) ? rKKT ? 



Theorem 14 Let U,M,b,B, and A be as defined in ( |2.1| ) and ((43). Given T > 0, let . ^+(t), 
and & be as defined above. If 



^ and null(M) n null(A T ) 
then the following statements hold. 



{0}, 



(4.8)? key IP ? 
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(i) Fy l \s,q,u,y) is invertible for all (s,q,u,y) G 



(ii) Define #+ = {(s,q) \ 3 (u,y) G M m x W s.t. (s,q,u,y) G #+ }. 7ften for each (s,q) G 
there exists a unique (u,y) G W n x M" smc/i that (s,q,u,y) G J^+. 

(m) The set &+(t) is bounded for every T > 0. 

( iv) For every g G there is a unique (s, q, u,y) G such that g = (s\q\ , S2q2, S£<li) T - 

(v) For every y> 0, there is a unique solution [s(Y),q(Y),u(Y),y(y)] to the equation Fy(s,q,u,y) = 
0. Moreover, these points form a differentiable trajectory in M v xR v x M m x M". Zrc particular, 
we may write 

v = {[s(r),q(r)Ar),y(r)] \r>o} . 



fv/j 77ze sef of cluster points of the central path as y\, « non-empty, and every such cluster point 
is a solution to (14. 6b. 



Please see the Appendix for proof. Theorem 14 shows that if the conditions ( |4.8[ ) hold, then IP 
techniques can be applied to solve the problem (43 1. In all of the applications we consider, the 
condition null(M) nnull(A T ) = {0} is easily verified. For example, in the setting of (4.3 1 with 



U v = {u | A v u < a v } and U w = {u\ A w u < b w } 



(4.9) ? uvuw ? 



this condition reduces to 



null(M v ) nnuU(Aj) = {0} and null(M w ) nnull(A^) = {0}. 



(4.10) ? NullProperty ? 



Corollary 15 The densities corresponding to l\ , £2, Huber, and Vapnik penalties all satisfy hypoth- 
esis ( |4 .IP! ). 

Proof We verify that null(M) n null(A T ) = for each of the four penalties. In the £2 case, M has 
full rank. For the £\, Huber, and Vapnik penalties, the respective sets U are bounded, so U°° = {0}. 
■ 

On the other hand, the condition 7^ is typically more difficult to verify. We show how this 
is done for two sample cases from class ( |1.4| >, where the non-emptiness of ^ + is established by 
constructing an element of this set. Such constructed points are useful for initializing the interior 
point algorithm. 

4.1 li-£ 2 : 

Suppose V(y;R) = [[R^^v^ and W(w;Q) = \ \(r xll w\\ r In this case 

Uv — [ l«u Ira]; M v — mX Mi) b v — m , B v — Imxni) 
U w = K , M w = Inxni b w = n , B w = l n xni 
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and R G K mxm and Q G W ixn are symmetric positive definite covariance matrices. Following the 
notation of (14.31) we have 



U = [—1,1] x R n , M ■ 



0„ 







B 



R-^H 
Q 1/2 G 



The specification of U in (4.5 ) is given by 



Imxm 0«x« 
~Imxm Onxn 



and a 
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to 



Clearly, the condition null(M) nnull(A T ) = {0} in ( |4.8| > is satisfied. Hence, for Theorem 
apply, we need only check that J£~ + ^ 0. This is easily established by noting that (s,q,u,y) G J^+, 
where 

where, for g G R e , g + is defined componentwise by = max{g,-,0} and g_rp, = min{g,-,0}. 
4.2 Vapnik - Huber: 



Suppose that V(v;R) and W(w;Q) are as in ( |4.1| ) and ( |4.2| ), respectively, with V a Vapnik penalty 
and W a Huber penalty: 



£l m 



el« 



U v = [0,l m ] X [0,l m ], M r = 2m x2m, = 

Uw = [ fin, fin]) M w = Inxm t>w = 0«, B\v = Inxn t 

and R G R mxm and 2 G M nx " are symmetric positive definite covariance matrices. Following the 



notation of ( |4.3| > we have 

U = ([0,1m] x [0,l w ]) x [-fCl„,fCl„], M 
^l OT + /?-V2^ 

el m -/? ~ 1/2 z 1,5 



1/2, 



02m x 2m 

4x1 

R l ' 2 H 
-R l l 2 H 
Q 1/2 G 



The specification of £/ in ( |4.5[ ) is given by 



hnxm 





" 




/lm\ 


~lmxm 










0m 





Imxm 





and a = 


1m 





Imxm 





0m 








J-nxn 




JCl„ 








Inxn_ 







Since null(A T ) = {0}, the condition null (M) n null (A T ) = {0} in ( |4T8| ) is satisfied. Hence, for 
Theorem [14] to apply, we need only check that ^ 0. We establish this by constructing an 
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element (s,q,u,y) of For this, let 





AA 




A?A 


u= \u 2 \,s = 

w 








S3 




43 


S 4 

w 


. ? = 


<?4 
<?5 



and set 

y = n , U\=U 2 = jl£, U 3 = n , Si = S 2 = S3 = S 4 = \\(, S 5 = S 6 = Kl n , 

and 

qi =l m -{el m + R- [/2 z)-, q 2 = l m + (el m +R- 1/2 z)+, 
q 3 = l m - {el m -R- [/2 z)-, q 4 = l m + (el m -R- 1/2 z)+, 

q 5 = ln- (<2~ 1/2 Ai)-, 16 = In + (Q- l/2 ll)+ ■ 

Then (s,q,u,y) € 



5. Kalman Smoothing with PLQ penalties 

Consider now a dynamic scenario, where the system state x k evolves according to the following 
stochastic discrete-time linear model 

X\ =Xq+Wi 

*k = G k X k -\+W k , k = 2,3,...,N (5.1)? LinearGaussModel 
Zk = H k x k + v k , k=l,2,...,N 

where xq is known, z k is the m-dimensional subvector of z containing the noisy output samples 
collected at instant k, G k and H k are known matrices. Further, we consider the general case where 
{w k } and {v k } are mutually independent zero-mean random variables which can come from any of 
the densities introduced in the previous section, with positive definite covariance matrices denoted 
by {Qk} and {R k }, respectively. 

In order to formulate the Kalman smoothing problem over the entire sequence {x k }, define 

x = vecjxi, • • • ,xn} , w = vecjwi, • • • ,w^} 

v = vec{vi,--- ,v N } , <2 = diag{<2i, ■■ ,Q N } 

/? = diag{/?i, -- ,R N } , H = dmg{H u -- ,H N }, 



and 



I 

-G 2 I 



'•• 
-G N I 
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Then model (5T I can be written in the form of ( | 1 . 1 | >-( pT2] >, i.e., 



jj. = Gx + w 
z = Hx + v , 



(5.2) ? fullStat ? 



where x G W is the entire state sequence of interest, w is corresponding process noise, z is the 
vector of all measurements, v is the measurement noise, and /I is a vector of size nN with the 
first n-block equal to xq, the initial state estimate, and the other blocks set to 0. This is precisely 
the problem ( |1.1| )-( [L~2"] ) that began our study. The problem ( |1.3| > becomes the classical Kalman 
smoothing problem with quadratic penalties. When the penalties arise from general PLQ densities, 
the general Kalman smoothing problem takes the form ( |4.3[ ), studied in the previous section. The 
details are provided in the following remark. 



Remark 16 Suppose that the noises w and v in the model (5.2i are PLQ densities with means 0, 
variances Q and R (see Def. 12). Then, for suitable U w ,M w ,b w ,B w and U v ,M v ,b v ,B v and corre- 
sponding p w and p v we have 



p(w) oc exp 



p(u w ,M w ,b w ,B w ;Q-V 2 



xv 



p(v) oc exp \-p(U v ,M v ,b v ,B v ;R- 1/2 v) 



(5.3) ? kalmanDensities ? 



while the MAP estimator of x in the model ( 5.2 ) is 



argmin ■ 



U w ,M w ,b w ,B w ;Q-^ 2 (Gx- n 



U v ,M v ,b v ,B v ;R- l/2 (Hx- 



(5.4) ? PLQsubproblem ? 



If U w and U v are given as in (4.9 1, then the system (|4.6|) decomposes as 



= 

= 

= 

= 

= 

< 



A w u w + s 

T 



j — Ay U\> ~\~ S\; a^> 
= sjq v 
b w + B w Q~ l/2 Gd-M w u w -A w q w 
b v -B v R- l l 2 Hd-M v u v -A v q v 
G T Q- T ^ 2 Blu w -H T R- J / 2 Bj,u v 



(5.5) ? PLQFinalConditic 



See the Appendix for details on deriving the KKT system. By further exploiting the decomposition 
shown in fl5.1] ), we obtain the following theorem. 



Theorem 17 (PLQ Kalman smoother theorem) Suppose that all wt and vt in the Kalman smooth- 

{0} , Vk , 



ing model (5.1) come from PLQ densities that satisfy 
null(M;!')nnull((A^) T ) 



{0} ,null(M^')nnull((A^ T 

i.e. their corresponding penalties are finite -valued. Suppose further that the corresponding set 
from Theorem 14 is nonempty. Then (5.4) can be solved using an IP method, with computational 
complexity 0[N(n 3 +m 3 + /)], where I is the largest column dimension of the matrices {A^} and 



(5.6) ? NullPropertyKS ? 
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Figure 2: Simulation: measurements (•) with outliers plotted on axis limits (4 and —2), true function 
(continuous line), smoothed estimate using either the quadratic loss (dashed line, left 
panel) or the Vapnik's £-insensitive loss (dashed line, right panel) 



Note that the first part of this theorem, the solvability of the problem using IP methods, already fol- 
lows from Theorem 14 The main contribution of the result in the dynamical system context is the 
computational complexity. The proof is presented in the Appendix and shows that IP methods for 
solving ( |5.4[ ) preserve the key block tridiagonal structure of the standard smoother. If the number of 
IP iterations is fixed (10 — 20 are typically used in practice), general smoothing estimates can thus 
be computed in 0[N(n 3 +m 3 + /)] time. Notice also that the number of required operations scales 
linearly with /, which represents the complexity of the PLQ density encoding. 
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6. Numerical example 
6.1 Simulated data 

In this section we use a simulated example to test the computational scheme described in the previ- 
ous section. We consider the following function 

fit) = exp [sin(8f )] 



taken from (Dinuzzo et al. 2007 1. Our aim is to reconstruct / starting from 2000 noisy samples 
collected uniformly over the unit interval. The measurement noise v# was generated using a mixture 
of two Gaussian densities, with p = 0.1 denoting the fraction from each Gaussian; i.e., 



;i-p)N(0,0.25)+pN(0,25) 



Data are displayed as dots in Fig. [2] Note that the purpose of the second component of the Gaussian 
mixture is to simulate outliers in the output data and that all the measurements exceeding vertical 
axis limits are plotted on upper and lower axis limits (4 and -2) to improve readability. 
The initial condition f(0) = 1 is assumed to be known, while the difference of the unknown func- 
tion from the initial condition (i.e. /(•) — 1) is modeled as a Gaussian process given by an inte- 
grated Wiener process. This model captures the Bayesian interpretation of cubic smoothing splines 



(Wahba, 1990), and admits a two-dimensional state space representation where the first component 
of x(t), which models /(•) — 1, corresponds to the integral of the second state component, modelled 
as Brownian motion. To be more specific, letting At = 1/2000, the sampled version of the state 
space model (see ( Jazwinski) 1970[ Oksendal 2005 ) for details) is defined by 



G k 



1 

At 1 



k = 2,3,..., 2000 



H k =[0 1], k =1,2,... ,2000 



with the autocovariance of wu given by 



At 

AT 
2 



AT 
2 

3 



k = 1,2,..., 2000 



where A 2 is an unknown scale factor to be estimated from the data. 

We compare the performance of two Kalman smoothers. The first (classical) estimator uses a 
quadratic loss function to describe the negative log of the measurement noise density and con- 
tains only A 2 as unknown parameter. The second estimator is a Vapnik smoother relying on the e- 
insensitive loss, and so depends on two unknown parameters: A 2 and £. In both cases, the unknown 
parameters are estimated by means of a cross validation strategy where the 2000 measurements are 
randomly split into a training and a validation set of 1300 and 700 data points, respectively. The 
Vapnik smoother was implemented by exploiting the efficient computational strategy described in 
the previous section, see ( |Aravkin et al.| 201 lb ) for specific implementation details. In this way, 
for each value of A 2 and e contained in a 10 x 20 grid on [0.01, 10000] x [0, 1], with A 2 logarithmi- 
cally spaced, the function estimate was rapidly obtained by the new smoother applied to the training 
set. Then, the relative average prediction error on the validation set was computed, see Fig. [3] 
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Average prediction error on the validation set 




Figure 3: Estimation of the smoothing filter parameters using the Vapnik loss. Average prediction 
error on the validation data set as a function of the variance process A 2 and e. 



The parameters leading to the best prediction were A 2 = 2.15 x 10 3 and e = 0.45, which give a 
sparse solution defined by fewer than 400 support vectors. The value of A 2 for the classical Kalman 
smoother was then estimated following the same strategy described above. In contrast to the Vapnik 
penalty, the quadratic loss does not induce any sparsity, so that, in this case, the number of support 
vectors equals the size of the training set. 

The left and right panels of Fig. [2] display the function estimate obtained using the quadratic and 
the Vapnik losses, respectively. It is clear that the estimate obtained using the quadratic penalty is 
heavily affected by the outliers. In contrast, as expected, the estimate coming from the Vapnik based 
smoother performs well over the entire time period, and is virtually unaffected by the presence of 
large outliers. 



6.2 Real industrial data 



Let us now consider real industrial data coming from Syncrude Canada Ltd, also analyzed in Liu 



et al. (2004). Oil production data is typically a multivariate time series capturing variables such as 



flow rate, pressure, particle velocity, and other observables. Because the data is proprietary, the ex- 



act nature of the variables is not known. The data from Liu et al. ( 2004 ) comprises two anonymized 
time series variables, called V14 and V36, that have been selected from the process data. Each time 
series consists of 936 measurements, collected at times [1,2, .. . ,936] (see the top panels of Fig. |4]). 
Due to the nature of production data, we hypothesize that the temporal profile of the variables is 
smooth and that the observations contain outliers, as suggested by the fact that some observations 
differ markedly from their neighbors, especially in the case of V14. 

Our aim is to compare the prediction performance of two smoothers that rely on £2 and i\ measure- 
ment loss functions. For this purpose, we consider 100 Monte Carlo runs. During each run, data are 
randomly divided into three disjoint sets: training and a validation data sets, both of size 350, and 
a test set of size 236. We use the same state space model adopted in the previous subsection, with 
At = 1 , and use a non-informative prior to model the initial condition of the system. The regulariza- 
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Figure 4: Left panels: data set for variable 14 (top) and relative percentage errors in the recon- 
struction of the test set obtained by Kalman smoothers based on the £2 and the £\ loss 
(bottom). Right panels: data set for variable 36 (top) and relative percentage errors in the 
reconstruction of the test set obtained by Kalman smoothers based on the £2 and the l\ 
loss (bottom). 



tion parameter 7 (equal to the inverse of A 2 assuming that the noise variance is 1) is chosen using 
standard cross validation techniques. For each value of 7, logarithmically spaced between 0.1 and 
1000 (30 point grid), the smoothers are trained on the training set, and the 7 chosen corresponds to 
the smoother that achieves the best prediction on the validation set. After estimating 7, the variable's 
profile is reconstructed for the entire time series (at all times [1,2,..., 936]), using the measurements 
contained in the union of the training and the validation data sets. Then, the prediction capability 
of the smoothers is evaluated by computing the 236 relative percentage errors (ratio of residual and 
observation times 100) in the reconstruction of the test set. 

In Fig. [4] we display the boxplots of the overall 23600 relative errors stored after the 100 runs for 
V14 (bottom left panel) and V36 (bottom right panel). One can see that the l\ -Kalman smoother 
outperforms the classical one, especially in case of V14. This is not surprising, since in this case 
prediction is more difficult due to the larger numbers of outliers in the time series. In particular, for 
V14, the average percentage errors are 1.4% and 2.4% while, for V36, they are 1% and 1.2% using 
£\ and £2, respectively. 
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7. Conclusions 



We have presented a new theory for robust and sparse estimation using nonsmooth QS penalties. We 
give both primal and dual representations for these densities and show how to obtain closed form 
expressions using Euclidean projections. Using their dual representation, we first derived conditions 
allowing the interpretation of QS penalties as negative logs of true probability densities, thus estab- 
lishing a statistical modeling framework. In this regard, the coercivity condition characterized in 
Th. [TOjplayed a central role. This condition, necessary for the statistical interpretation, underscores 
the importance of an idea already useful in machine learning. Specifically, coercivity of the objec- 
tive ( 1 .4 ) is a fundamental prerequisite in sparse and robust estimation, as it precludes directions 
for which the sum of the loss and the regularizer are insensitive to large parameter changes. Thus, 
the condition for a QS penalty to be a negative log of a true density also ensures that the problem 
is well posed in the machine learning context, i.e. the learning machine has enough control over 
model complexity. 

The QS class captures a variety of existing penalties used for both as misfit measures and as regular- 
ization functionals. We have also shown how to construct natural generalizations of these penalties 
within the QS class that are based on general norm and cone geometries. Moreover, we show how 
the structure of these functions can be understood through the use of Euclidean projections. More- 
over, it is straightforward to use the presented results to design new formulations. Specifically, 
starting with the requisite shape of a new penalty, one can use results of Section [3] to obtain a stan- 
dardized corresponding density, as well as the data structures U,M,B,b required to formulate and 
solve the optimization problem in Section [4] The statistical interpretation for these methods allows 
us to prescribe the mean and variance parameters of the corresponding model. 
In the second part of the paper, we presented a broad computational approach to solving estimation 
problems (1.4i using interior point methods. In the process, we derived additional conditions that 
guarantee the successful implementation of IP methods to compute the estimator ( 1.4 1 when x and 
v come from PLQ densities (a broad subclass of QS penalties), and provided a theorem character- 
izing the convergence of IP methods for this class. The key condition required for the successful 
execution of IP iterations was a requirement on PLQ penalties to be finite valued, which implies 
non-degeneracy of the corresponding statistical distribution (the support cannot be contained in a 
lower-dimensional subspace). The statistical interpretation is thus strongly linked to the computa- 
tional procedure. 

We applied both the statistical framework and the computational approach to the broad class of state 
estimation problems in discrete-time dynamic systems, extending the classical formulations to al- 
low dynamics and measurement noise to come from any PLQ densities. Moreover, we showed that 
the classical computational efficiency results can be preserved when the general IP approach is used 
in the state estimation context; specifically, PLQ Kalman smoothing can always be performed with 
a number of operations that is linear in the length of the time series, as in the quadratic case. The 
computational framework presented therefore allows the broad application of interior point meth- 
ods to a wide class of smoothing problems of interest to practitioners. The powerful algorithmic 
scheme designed here, together with the breadth and significance of the new statistical framework 
presented, underscores the practical utility and flexibility of this approach. We believe that this per- 
spective on modeling, robust/sparse estimation and Kalman smoothing will be useful in a number 
of applications in the years ahead. 

In many contexts, it would be useful to estimate the parameters that define QS penalties; for exam- 
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pie the K in the Huber penalty or the e in the Vapnik penalty. In the numerical examples presented 
in this paper, we have relied on cross-validation to accomplish this task. An alternative method 
could be to compute the MAP points returned by our estimator for different filter parameters to gain 
information about the joint posterior of states and parameters. This strategy could help in designing 



a good proposal density for posterior simulation using e.g. particle smoothing filters (Ristic et al. 



2004 ). We leave a detailed study of this approach to the QS modeling framework for future work. 



8. Appendix 

8.1 Proof of Theorem H 

Let p(y) = p(U,M,I,0;y) so that p(U ,M,B,b;y) = p(b + By). Then dom (p(U,M,B,b; •)) = 
B 1 (dom(p) — b), hence the theorem follows if it can be shown that bar(£/) + Ran(M) C dom(p) C 
[U°° n null(M)]° with equality when bar(t/) + Ran (M) is closed. Observe that if there exists w G 
U°° nnull(M) such that (y, w) > 0, then trivially p(y) = +oo soy ^ dom(p). Consequently, dom(p) C 
[U°° nnull(M)]°. Next let y G bar(t/) +Ran(M), then there is a v G bar(t/) and w such that 
y = v +Mw. Hence 



sup[(«, y) 



(u, Mu)} 



sup[(w, v+Mw) 

ueU 



(u,Mu)] 



= sup[(w, v) + \w Mw- \{w-u) M(w-u)] 
ueU 

< 8*(v\U) + kw T Mw < oo . 



Hence bar([/) +Ran (M) Cdom(p). 



If the set bar(U) +Ran(M) is closed, then so is the set bar(£/). Therefore, by (Rockafellar 



1970, Corollary 14.2.1), ([/°°)° = bar(I7), and, by ( |Rockafellar[ [1970] Corollary 16.4.2), [U°° n 
null(M)]° = bar(U ) + Ran (M), which proves the result. 

The polyhedral case bar(£/) is a polyhedral convex set, and the sum of such sets is also a poly- 
hedral convex set ( [Rockafellar} 1 1970[ Corollary 19.3.2). 



8.2 Proof of Theorem H 

To see the first equation in ( |2.4[ ) write p(y) = sup H [(y, u) — (5 ||L r M||2 + 5 (u \ U ))] , and then apply 
the calculus of convex conjugate functions ( Rockafellar) 1 1970] Section 16) to find that 



( 



i||L 7 



: + d(-\U))\y)= MMs\\ 2 2 + d*(y-Ls\U: 



The second equivalence in ( |2.4[ ) follows from (Rockafellar 1970 Theorem 14.5). 

For the remainder, we assume that M is positive definite. In this case it is easily shown that 
(MU)° =M- l U°. Hence, by dRockafellarll 19701 Theorem 14.5), y(-\MU) = 5* (• \M- l U°). We 



use these facts freely throughout the proof. 
The formula (|2.5|) follows by observing that 



ilkll 2 



- 1 u*ui + 8*(y-Ls\U) = l\\L- T s\\t t + 8* 



(M- l y- 



-L- T s\MU* 



and then making the substitution v = L T s. To see (|2.6|>, note that the optimality conditions for (|2.5|> 



are Ms G dd* (M ^-s\ MU ) , or equivalently, M l y-s eN (Ms \ MU ), i.e. s G U and 



(M 



-l 



y — s, u — s 



M 



{M- l y-s,M(u-s)) <0V ueU, 
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which, by ( [23) , tells us that s = P M (M~ ^{U). Plugging this into ( [23] ) gives < |2T6 



Using the substitution v = Li, the argument showing ( |2.7[ ) and ( |2.8[ ) differs only slightly from 
that for (12. 5b and (|2.5|) and so is omitted. 



The formula (2.9 1 follows by completing the square in the M-norm in the definition (2.1 



(y, u) — j (w, Mm) 



(M" V, «) 



M 



2 ( M ' M )m 

|y r M- x y - i [<M-V, ilfV)* " 2 (M^y, M ) m + (u, u) M ] 
\y T M- l y-\\\M- l y-u\\ 2 M . 



The result as well as ( 2.10 i now follow from Theorem [6] Both ( 2.11 ) and ( 2.12 1 follow similary by 
completing the square in the M _1 -norm. 

8.3 Proof of Theorem [9] 

First we will show that if p is convex coercive, then for any x G argmin/ ^ 0, there exist constants 
R and K > such that 

p{x) > p(x)+K\\x-x\\ Vx^tfB. (8. 1) ?boundRho? 

Without loss of generality, we can assume that = p (0) = inf p . Otherwise, replace p (x) by 
p (x) = p (x + x) — p (x) , where x is any global minimizer of p . 

Let a > 0. Since p is coercive, there exists R such that lev p (ct) C /?B. We will show that 
|||x|| <p(jt)foralU^m 

Indeed, for all x / 0, we have p(prrx) > a. Therefore, if x £ RE, then < < 1, and we have 



a.. ,. \\x\\ 

— be < Ii — li -p 



« be 



Then by dO), 



/ exp(— p(x))cix = / exp(— p(x))Jx+ / exp(— p(x))cf. 

J Jjc+RB J|Lv-,r||>R 



<Ci+C 2 



||jc-x||>R 



exp(— ^||x — x||)<ix < oo . 



\v?Mu 



8.4 Proof of Theorem [H 

First observe that B- 1 [cone(J7)]° = [fi T cone(£/)]° by ( |Rockafellar[|l970[ Corollary 16.3.2). 

Suppose that y G B 1 [cone(£/)]°, andy / 0. Then By G cone(£/), and By^O since B is injective, 
and we have 

p(ty) = sup ueU {b+tBy,u)- 

= sup ueU {b,u) — ^u T Mu + t(By,u) 

< sup ueU (b,u) — ^u T Mu 

< p(U,M,0,I;b), 

so p (ty) stays bounded even as t — > oo, and so p cannot be coercive. 

Conversely, suppose that p is not coercive. Then we can find a sequence {yk} with \\ykW > k 
and a constant P so that p(y&) < P for all /c > 0. Without loss of generality, we may assume that 

ft-* 
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Then by definition of p, we have for all u G U 

{b + By k ,u) - \u J Mu < P 
(b+By k ,u) <P+\u T Mu 



IWI ^ ihkW 



Note that y ^ 0, so By ^ 0. When we take the limit as k — > °°, we get (By,u) < 0. From this 
inequality we see that #y € [cone(f/)]°, and so y G 5 1 [cone(f/)]°. 

8.5 Proof of Theorem [H 

Proof (i) Using standard elementary row operations, reduce the matrix 

/ 



,-(1) - 








to 



I 

D(s) 


.° 

where T = M J t-AD(q)D(s)~ l A T . The matrix T is invertible since null(M) nnul^C 1 
we can further reduce this matrix to the block upper triangular form 



A T 





D(s) 





—A -M 


B 


B 1 





A T 0" 




D(q)A T 




-T B 




B J 





(8.2) ? rKKTder 



{0}. Hence, 



1 





A T 








D(,) 


-D(^)C T 











-r 


B 











-B T T l B 



Since B is injective, the matrix B J T l B is also invertible. Hence this final block upper triangular is 
invertible proving Part (i). 

(ii) Let (s,q) G S , and choose so that (s,q,Ui,yi) G for / = 1,2. Set w := mi — «2 and 

y := yi —y2- Then, by definition, 

: A t m, = By-Mu, and0 = B T « . 







(8.3)?null sp 



Multiplying the second of these equations on the left by u and utilizing the third as well as the 
positive semi-definiteness of M, we find that Mu = 0. Hence, u G null(M) n null(A T ) = {0}, and so 
By = 0. But then y = as B is injective. 

(iii) Let (s,q,u,y) G and (s,q,u,y) G &+{?). Then, by ff6j, 



[(a-A T u) - (a-A T u)] T {q-q) 
(u-u) T (Aq-Aq) 

(u - u) T [(b + By- Mu) -(b + Bb- Mu)} 
(u — u) T M(u — u) 



> 0. 
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Hence, 

T + fq > s T y + s T q > s T y + y T s>t; \\(s,g)\\ 1} 
where % = min{s,-, qi\i = !,...,£}> 0. Therefore, the set 

<P+(z) = {(s,q) \{s,q,u,y)e^+(z)} 

is bounded. Now suppose the set ^+(t) is not bounded. Then there exits a sequence {(s v ,q v ,u v , y v )} C 
^+(t) such that ||( s v ,q v , u v ,y v ) || "f" +00. Since J^+(t) is bounded, we can assume that || (iiv,yv) || t 
+00 while ||(*V)9v)|| remains bounded. With no loss in generality, we may assume that there exits 
(u,y) 7^ (0,0) such that (u v ,y v )/ ||(w v ,y v ) [| — > (u,y). By dividing ( |4.6| ) by || (w v ,y v )[| and taking the 
limit, we find that ( |8.3| > holds. But then, as in ( |8.3[ ), (u,y) = (0,0). This contradiction yields the 
result. 

(iv) We first show existence. This follows from a standard continuation argument. Let (s,q,u,y) G 
and v G R £ ++ . Define 



F(s,q,u,y,t) 

where | := (jiji , . . . , f^) T . Note that 



s+A T u — a 
D(q)D(s)l-[(\-t)v + tv] 
By—Mu-Aq 
B T u + b 



(8.4) ?homotopy F? 



F(s,q,u,y,0) = and, by Part (i), V^^ U;y) F(s,q,u,y,0) 1 exists. 

The Implicit Function Theorem implies that there is a t > and a differentiable mapping t \-t 

(s(t),q(t),u(t),y(t)) on [0,t) such that 

F[s(t),q(t),u(t),y(t),*] = 0on M- 
Let t > be the largest such t on [0, 1]. Since 

{[s(t),q(t),u(t),y(t)] \t G [0,T)} C 

where f = max{l T g, l T g}, Part (iii) implies that there is a sequence f,- — > F and a point (s,q,u,y) 
such that 9(^1), «(^')>y( f i)] ~~ ^ (*j9>">y)- By continuity F(s,g,w,;y,F) =0. If F = 1, we are done; 
otherwise, apply the Implicit Function Theorem again at (s,q,u,y,t) to obtain a contradiction to the 
maximality of F. 

We now show uniqueness. By Part (ii), we need only establish the uniqueness of (s,q). Let 
(s v ,q v ) G be such that g = (sjr^qjn^Sjmqjm,- ■ ■ , s j(t\ ( lj(e))' 1 > where ^ya denotes the /th ele- 
ment of jy, and j = 1,2. As in Part (iii), we have (s\ — s 2 ) T (qi — q-i) = (u\ — u 2 ) T M((u\ — « 2 ) > 0, 
and, for each i = !,...,£, su^qu^ = *2(i)92(/) = Si > 0- If ( J i><7l) / (^2,^2), then, for some 
...,£}, {s l{i) - s 2{i) )(q 1{i) - q 2{ i ) ) > and either / s 2{i) or / < ?2( ,- ) . If > s 2 (;), 
then > ^2(0 > so that gi = sunqun > s 2 ^q 2 ^ = gi, a contradiction. So with out loss in 
generality (by exchanging (s\,q\) with (52,^2) if necessary), we must have qu* > q 2 ^y But then 
s i(i) > s 2(i) > 0, so tnat again gi = suAqun > *2(i)?2(i) = gi> and again a contradiction. Therefore, 
(s,q) is unique. 
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(v) Apply Part (iv) to get a point on the central path and then use the continuation argument to trace 
out the central path. The differentiability follows from the implicit function theorem. 

(vi) Part (iii) allows us to apply a standard compactness argument to get the existence of cluster 
points and the continuity of Fy(s,q,u,y) in all of its arguments including / implies that all of these 
cluster points solve (|4.6|). ■ 



8.6 Details for Remark [16] 



The Lagrangian for (5.4i for feasible (x,u w ,u v ) is 







u w 




u w 


T 


~M W 


" 




u w 




u w 




-B W Q l / 2 G 


h. 


5 


u v 


H 


u v 







M v _ 




u v 


-< 


u v 




B V R l l 2 H 



(8.5) ?PLQLagrangian? 



where b w = b w — B W Q x ' 2 xq and b v = b v — B V R l l 2 z- The associated optimality conditions for 
feasible (x, u w ,u v ) are given by 

G T Q- T / 2 Blu w -H J R- T / 2 Bj,u v = 

b w - M W U W + B W Q l l 2 Gx G N Uw (U w ) (8.6) ? PLQOptimalityCor 

b v -M v u v - B V R l l 2 Hx G N Uv {u v ) , 

where Nc(r) denotes the normal cone to the set C at the point r (see (Rockafellar 1970) ) for details). 

Since U w and U v are polyhedral, we can derive explicit representations of the normal cones 
Nu w {u w ) and Nu v {u v ). For a polyhedral set U C W" and any point u G U, the normal cone Nu(u) is 
polyhedral. Indeed, relative to any representation 

U = {u\A T u < a} 

and the active index set I{u) := {/| (A,-, u) = af\, where A denotes the ith column of A, we have 

\ <?iAi H h q m A m \qi>0 for i G I{u) ) 

I qi = 0for/ ^I(u)\ 

Using (&J_), Then we may rewrite the optimality conditions ( |8.6| ) more explicitly as 

G T Q- T/2 Blu w -H T R- J / 2 Blu v = 
b w - M w u w + B w Q~ l/2 Gd = A w q w 

b v - M V U V - B v R- l/2 Hd = A v q v (8-8) ? PLQExpandedCondi 

{q v > 0|<7 V ( () = for i 1(u v )} 
{q w > 0\q w (i) = for /G7(w w )} 



Nu(u) 



(8.7) ?NormalRep? 



where q v n\ and q w n\ denote the ith elements of q v and q w . Define slack variables s w > and s v > 
as follows: 

s w = a w -A?u w (8.9)?siack? 

Sy (Xy V V* 

Note that we know the entries of q w u\ and q v n\ are zero if and only if the corresponding slack 
variables s v u\ and s w n) are nonzero, respectively. Then we have qj v s w = qj,s v = 0. These equations 
are known as the complementarity conditions. Together, all of these equations give system (|5.5 I. 
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8.7 Proof of Theorem Q3 



IP methods apply a damped Newton iteration to rind the solution of the relaxed KKT system Fy = 0, 
where 



(s w \ 




A w u w -\-s w ct w 


s v 






q w 




D(q w )D{s w )l-yl 


q v 




D(q v )D(s v )l-Yl 


u w 




b w +B w Q- l l 2 Gd-M w u w -A w q 


u v 




b v - B v R~ l l 2 Hd - M v u v - A v q v 


\x) 




_ G T Q- T l 2 Blu w -H T R- T l 2 Blu v 



This entails solving the system 



(s w \ 




~As w ~ 




fs w \ 


Sv 




As v 




s v 


q w 




Aq w 




q w 


q v 




Aq v 


= -Fy 


q v 


u w 




Au w 




u w 


u v 




Au v 




Uy 


\x) 




. A* . 




\x) 



(8.10) ? NewtonSystem ? 



where the derivative matrix Fy is given by 



" / 











{Aw) T 













/ 











(Av) T 







D(q w ) 





D(s w ) 



















D(q v ) 





D(s v ) 











(8.11)?Fprime? 








Aw 





-M w 





B W Q l/2 G 













-Ay 





-M v 


—B v R l l 2 H 
















G^Q-y 2 Bi 


-H T R- T / 2 Bj, 








,(1) 



We now show the row operations necessary to reduce the matrix Fy in (8.11|> to upper block 
triangular form. After each operation, we show only the row that was modified. 

row3 <— row3 —D(q w ) rowi 

[0 D(s w ) -D(q w )Al 0] 
row4 <— row 4 —D(q v ) row 2 

[0 D{s v ) -D{q v )Al 0] 
rows <r- rows +A w D(s w )~ l row3 

[0 -T w B w Q- l l 2 G] 
row6 <— row6 +A v D(s v )~ l rov/4 

[0 -T y -B V R l / 2 H] . 



In the above expressions, 



T 

1 w 



M w +A w D(s v 



D(q w )Al 



■ M v +A v D(s v )- l D(q v )A v , 



(8.12) ? KalmanT ? 
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where D(s w ) D(q w ) and D(s v ) D(q v ) are always full-rank diagonal matrices, since the vectors 
s w ,q w ,s v ,q v . Matrices T w and T v are invertible as long as the PLQ densities for w and v satisfy ( |4. 10 1. 



Remark 18 (block diagonal structure of T in i.d. case) Suppose that y is a random vector, y = 
vecd)^}), where each y\ is itself a random vector in W n ( l \from some PLQ density 
p(y ; -) exp[— C2p(Ui,Mj,0,I; •)], and all y* are independent. Let {/,• = {m : A^w < a,}. 77je?i f/je 
matrix T p is given by T p = M + ADA T where M = diag[Mi , • • • ,M N ], A = diag[Ai , • • • ,A N ], D = 
diag[Di , • • • ,Dn], and {D;} are diagonal with positive entries. Moreover, T p is block diagonal, with 
ith diagonal block given by Mi +AjD{Aj. 



From Remark 18 the matrices T w and T v in (8.12 1 are block diagonal provided that {wt} and {v{\ 
are independent vectors from any PLQ densities. 



We now finish the reduction of Fy~^ to upper block triangular form: 



row? <- row? + (g 1 Q 1 '^.r^ 1 ) row 5 



-T/2 



row6 



7 











(A W ) J 











/ 











(Av) T 











S w 






















s v 






















-T w 





B W Q l/2 G 

















-Tv 


—B v R l l 2 H 























where 

a = ci G +a H = g t q- t/2 bIt- 1 b w q-^ 2 g+h t r- t / 2 bJ,t- [ b v r-^ 2 h. 

Note that Q. is symmetric positive definite. Note also that £2 is block tridiagonal, since 

1 . Qh is block diagonal. 

2. Q~' I ' 2 B'^ I T~ 1 B W Q~' 1 ' 2 is block diagonal, and G is block bidiagonal, hence £2g is block tridi- 
agonal. 



(8.13)?OmegaMatr 



Solving system (8. 10 1 requires inverting the block diagonal matrices T v and T w at each iteration of 
the damped Newton's method, as well as solving an equation of the form Q.Ax = p. The matrices 
T v and T w are block diagonal, with sizes Nn and Nm, assuming m measurements at each time point. 
Given that they are invertible (see ( |4. 10| >), these inversions take 0(Nn 3 ) and 0(Nm 3 ) time. Since Q. 
is block tridiagonal, symmetric, and positive definite, Q.Ax = p can be solved in 0(Nn 3 ) time using 



the block tridiagonal algorithm in ( |Bell 2000 1. The remaining four back solves required to solve 



(8. 10 1 can each be done in 0(NI) time, where we assume that A v ha G M and A 



w(k) 



wnx/ 



at each 



time point k. 
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